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Abstract 

Margin theory provides one of the most popular explanations to the success of AdaBoost, where 
the central point lies in the recognition that margin is the key for characterizing the performance 
of AdaBoost. This theory has been very influential, e.g., it has been used to argue that AdaBoost 
usually does not overfit since it tends to enlarge the margin even after the training error reaches 
zero. Previously the minimum margin hound was established for AdaBoost, however, Breiman 
{lo| pointed out that maximizing the minimum margin does not necessarily lead to a better 
generalization. Later, Reyzin and Schapire ^] emphasized that the margin distribution rather 
than minimum margin is crucial to the performance of AdaBoost. In this paper, we show that 
previous margin bounds are special cases of the kth margin hound, and none of them is really 
based on the whole margin distribution. Then, we improve the empirical Bernstein bound given 



by Maurer and Pontil [2^. Based on this result, we defend the margin-based explanation against 
Breiman's doubt by proving a new generalization error bound that considers exactly the same 
factors as Schapire et al. [35] but is uniformly tighter than Breiman [l^'s bound. We also provide 
a lower bound for generalization error of voting classifiers, and by incorporating factors such as 
average margin and variance, we present a generalization error bound that is heavily related to 
the whole margin distribution. Finally, we provide empirical evidence to verify our theory. 
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1. Introduction 

The AdaBoost algorithm [isl . [l^ . which aims to construct a "strong" classifier by combining 
some "weak" learners (slightly better than random guess), has been one of the most influential 
classification algorithms 
datasets and real applications 



us [ij, l3g |. ai 
lications g, [l 



and it has exhibited excellent performance both on benchmark 



Many studies are devoted to understanding the mysteries behind the success of AdaBoost, among 
which the margin theory proposed by Schapire et al. jsS] has been very influential. For example, 
AdaBoost often tends to be empirically resistant (but not completely) to overfitting [9, 17, 32 1. 
i.e., the generalization error of the combined learner keeps decreasing as its size becomes ver 
large and even after the training error has reached zero; it seems violating the Occam's razor 
i.e., the principle that less complex classifiers should perform better. This remains one of the 
most famous mysteries of AdaBoost. The margin theory provides the most intuitive and popular 
explanation to this mystery, that is: AdaBoost tends to improve the margin even after the error 
on training sample reaches zero. 



However, Breiman [10|] raised serious doubt on the margin theory by designing arc-gv, a boosting- 
style algorithm. This algorithm is able to maximize the minimum margin over the training data, 
but its generalization error is high on empirical datasets. Thus, Breiman [10] concluded that 
the margin theory for AdaBoost failed. Breiman's argument was backed up with a minimum 
margin bound, which is tighter than the generalization bound given by Schapire et al. 



and a lot of experiments. Later, Reyzin and Schapire 



35|, 



341 ] found that there were flaws in the 



12] as base learners and fixed the number of 



3§ found 



design of experiments: Breiman used CART trees 

leaves for controlling the complexity of base learners. However, Reyzin and Schapire 
that the trees produced by arc-gv were usually much deeper than those produced by AdaBoost. 
Generally, for two trees with the same number of leaves, the deeper one is with a larger complexity 



because more judgements are needed for making a prediction. Therefore, Reyzin and Schapire [3j] 
concluded that Breiman's observation was biased due to the poor control of model complexity. 
They repeated the experiments by using decision stumps for base learners, considering that 
decision stump has only one leaf and thus with a fixed complexity, and observed that though 
arc-gv produced a larger minimum margin, its margin distribution was quite poor. Nowadays, 
it is well-accepted that the margin distribution is crucial to relate margin to the generalization 



performance of AdaBoost. To support the margin theory, Wang et al. [37\ presented a tighter 
bound in term of Emargin, which was beheved to be relevant to margin distribution. 

In this paper, we show that the minimum margin and Emargin are special cases of the kth margin, 
and all the previous margin bounds are single margin bounds that are not really based on the 
whole margin distribution. Then, we present a new empirical Bernstein bound, which slightly 
improves the bound in [20] but with different proof skills. Based on this result, we prove a 
new generalization error bound for voting classifier, which considers exactly the same factors as 



Schapire et al. 

n 



35l |. but is uniformly tighter than the bounds of Schapire et al. 35|] and Breiman 
[lo(]. Therefore, we defend the margin-based explanation against Breiman's doubt. Furthermore, 
we present a lower generalization error bound for voting classifiers, and by incorporating other 
factors such as average margin and variance, we prove a generalization error bound which is 
heavily relevant to the whole margin distribution. Finally, we make a comprehensive empirical 
comparisons between AdaBoost and arc-gv, and find that AdaBoost has better performance than 
but dose not absolutely outperform arc-gv, which verifies our theory completely. 

The rest of this paper is organized as follows. We begin with some notations and background in 
Sections [2] and O respectively. Then, we prove the kth margin bound and discuss on its relation 
to previous bounds in Section HI Our main results are presented in Section [5l and detailed proofs 
are provided in Section El We give empirical evidence in Section [7] and conclude this paper in 
Section El 



2. Notations 

Let X and y denote an input space and output space, respectively. For simplicity, we focus 
on binary classification problems, i.e., 3^ = {+1,-1}. Denote by D an (unknown) underlying 
probability distribution over the product space X x y. A training sample with size m 

S = {{Xi,yi),{x2,y2),--- ,{Xm,yrn)} 

is drawn independently and identically (i.i.d) according to distribution D. We use Pr/j]-] to refer 
as the probability with respect to D, and Pr5'[-] to denote the probability with respect to uniform 
distribution over the sample S. Similarly, we use E£)[-] and Es[-] to denote the expected values, 
respectively. For an integer m > 0, we set [m] = {1,2, - ■ ■ ,m}. 
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The Bernouhi Kuhback-Leiler (or KL) divergence is defined as 

KL{q\\p) = glog ^ + (1 - g) log for < < 1. 

For a fixed q, we can easily find that KL{q\\p) is a monotone increasing function for q < p < 1, 
and thus, the inverse of KL{q\\p) for the fixed q is given by 

KL~^{q;u) = inf {w: w > q and KL{q\\w) > u} . 
w 

Let H he a hypothesis space. Throughout this paper, we restrain H to be finite, and similar 
consideration can be made to the case when H has finite VC-dimension. We denote by 



|W| 

A base learner h e H is a function which maps a distribution over X x y onto a function 
h: X ^ y. Let C{'H) denote the convex hull of H, i.e., a voting classifier / G is of the 

following form 

/ = aihi with ctj = 1 and > 0. 

For N > 1, denote by Civ(H) the set of unweighted averages over N elements from H, that is 

N , 

CNin) = {g:g = ^^,h,eH}. (1) 

For voting classifier / G C('H), we can associate with a distribution over Ti by using the coefficients 
{a,i}, denoted by Q(/). For convenience, g G C^iji) ~ implies g = 'Yl!j=\^j/^ where 

- Qif). 

For an instance {x,y), the margin with respect to the voting classifier / = Yl(^ihi{x) is defined 
as yf{x); in other words, 

i: y=hi(x) i: y^hi{x) 

which shows the difference between the weights of base learners that classify (a;, y) correctly and 
the weights of base learners that misclassify (x, y). Therefore, margin can be viewed as a measure 
of the confidence of the classification. Given a sample S = {{xi,yi), (0:2,2/2)5 • • • , (a^mjj/m)}) we 
denote by yif{xi) the minimum margin and Es[yf{x)] the average margin, which are defined 
respectively as follows: 

yif{xi) = min{yi/(xi)} and Es[yf{x)] = -. 

i&m\ . — , m 

1=1 
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Algorithm 1 A unified description of AdaBoost and arc-gv 

Input: Sample S = {{xi,yi), {x2,y2), • • • , {xmiym)} and the number of iterations T. 
Initialization: Di{i) = l/m. 
for t = 1 to T do 

1 . Construct base learner ht: X ^ y using the distribution Dt ■ 

2. Choose at- 

3. Update 

Dt+i{i) = Dt{i)ex.p{-atyiht{xi))/Zt, 

where Zt is a normalization factor (such that Df+i is a distribution), 
end for 

Output: The final classifier sgn[/(x)], where 
T 

fix) =y2^^ — 

t=i Lt=i 



3. Background 



In statistical community, great efforts have been devoted to understanding how and why AdaBoost 
works. Friedman et al. [2^ made an important stride by viewing AdaBoost as a stagewise 
optimization and relating it to fitting an additive logistic regression model. Various new boosting- 
style algorithms were developed by performing a gradient decent optimization of some potential 
loss functions 



131, 



261, 



33l |. Based on this optimization view, some boosting-style algorithms and 



their variants have been shown to be Bayes's consistent under different settings 



31 



11 



22, 



39|. However, these theories can not be used to explain the resistance of AdaBoost to 



overfitting, and some statistical views have been questioned seriously by Mease and Wyner 
with empirical evidences. In this paper, we focus on the margin theory. 



Algorithm 1 provides a unified description of AdaBoost and arc-gv. The only difference between 
them lies in the choice of at- In AdaBoost, at is chosen by 



at 



ilni±^. 

2 1 - 7t ' 
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where 7t = X^ilLi Dt{i)yiht{xi) is called the edge of /ij, which is an affine transformation of the 
error rate of ht{x). However, Arc-gv sets at in a different way. Denote by pt the minimum 
margin of the voting classifier of round t — 1, that is, 

Pt = yift{xi) with pi = 

where 

ft = Y.T^i — ^^(^)- 

s=l 22s=l «^ 

Then, Arc-gv sets at as to be 

I. l+jt 1, 1 + 

at = — in m . 

2 l-jt 2 l-pt 



Schapire et al. [35f] first proposed the margin theory for AdaBoost and upper bounded the 
generalization error as follows: 

Theorem 1 ^3^] For any 5 > and 9 > 0, with probability at least 1 — 5 over the random choice 
of sample S with size m, every voting classifier f satisfies the following bound: 

Pr[y/(x) < 0] < Pr[y/(x) < 0] + O -= f ^ + In - 



Breiman [10| provided the minimum margin bound for arc-gv by Theorem [2] with our notations. 



Theorem 2 If 



= yifix,) >Jl^andR= ^^^^ < 2m, 
y \h\ mO'^ 

then, for any 5 > 0, with probability at least 1 — 5 over the random choice of sample S with size 
m, every voting classifier f satisfies the following bound: 

Pr[y/(x) < 0] < R{\n(2m) + \ii — + l] + — In^. 
D \ R / m 5 

Empirical results show that arc-gv probably generates a larger minimum margin but with higher 

tighter i\,sjiO{J^' 



generalization error, and Breiman's bound is tighter than 0{\ in Theorem[TJ Thus, 



Breiman cast serious doubt on margin theory. To support the margin theory, [37|] presented a 



tighter bound in term of Wang et al. Emargin by Theorem [3l which was 
to margin distribution. Notice that the factors considered by Wang et al. 
that considered by Schapire et al. 



De 



ieved to be related 



371 ] are different from 



35l | and Breiman [1 



Theorem 3 



'32] For any 6 > 0, with probability at least 1 — 6 over the random choice of the 



sample S with size m, every voting classifier f satisfying the following bound: 
InlT^I 



FT[yf{x) < 0] < 
D m 



+ inf KL-\q;u[e{q)]), 



where 



1 /81n|?^|, 2m^ , , m 



ande{q)=snp{ee (78^,1]: Frs[yf{x)<e]<q}. 



Instead of the whole function space, much work developed margin-based data-dependent bounds 
for generalization error, e.g., empirical cover number jscl, empirical fat-shattering dimension 
Rademacher and Gaussian complexities Some of these bounds are proven to be sharper 

than Theorem [H but it is difficult, or even impossible, to directly show that these bounds are 
sharper than the minimum bound of Theorem [21 and fail to explain the resistance of AdaBoost 
to overfitting. 



4. None Margin Distribution Bound 

Given a sample S of size m, we define the kth margin ykf{xk) as the /cth smallest margin over 
sample S, i.e., the fcth smallest value in {yif{xi),i £ [m]}. The following theorem shows that the 
kth margin can be used to measure the performance of a voting classifier, whose proof is deferred 
in Section [6Tl 

Theorem 4 For any 5 > and k £ [m], if = ykf{xk) > a/S/I^I, then with probability at 
least 1 — S over the random choice of sample with size m, every voting classifier f satisfies the 
following bound: 

Pr[y/(x) < 0] < + KL-^ ^) (2) 

D m \ m mJ 
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where 



'Mn— — +ln^ +lii. 



02 In 1^1 ' ' 6' 

Especially, when k is constant with m > Ak, we have 

It is interesting to study the relation between Theorem [H and previous results, especially for 
Theorems [2] and [3l It is straightforward to get a result similar to Breiman's minimum margin 
bound in Theorem [2l by setting A; = 1 in Eqn. ([3|: 



Corollary 1 For any 5 > 0, if = yi/(xi) > y^8/fH\, then with probability at least 1 — 6 over 
the random choice of sample S with size m, every voting classifier f satisfies the following bound: 

Inlnl 2 /81n(2|^|) , 2m^ ^ |^| 



Pry/ X <0 < ^^ + - L^ln^^+lnLJ 



Notice that when /c is a constant, the bound in Eqn. ([3]) is 0(lnm/m) and the only difference 
lies in the coefficient. Thus, there is no essential difference to select constant A;th margin (such 
as the 2nd margin, the 3rd margin, etc.) to measure the confidence of classification for large-size 
sample. 

Based on Theorem [H it is also not difficult to get a result similar to the Emargin bound in 
Theorem [3] as follows: 

Corollary 2 For any 5 > f), if 9k = ykf{xk) > \/^/\'H\, then with probability at least 1 — 5 over 
the random choice of the sample S with size m, every voting classifier f satisfying the following 
bound: 

In If/I / k — ^ n 

Pr[y/(x) < 0] < + inf KL-^' ■ ^ 



D m k£[m] \ m m 

where 

_81n(2|H|),^ 2m2 , , ^ m 



91 ■^^hrM+^"""+^"'^- 
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From Corollary O we can easily understand that the Emargin bound ought to be tighter than 
the minimum margin bound because the former takes the infimum range over k £ [m] while the 
latter focuses only on the minimum margin. 

In summary, the preceding analysis reveals that both the minimum margin and Emargin are 
special cases of the A:th margin; neither of them succeeds in relating margin distribution to the 
generalization performance of AdaBoost. 



5. Main Results 



We begin with the following empirical Bernstein bound, which is crucial for our main theorems: 

Theorem 5 For any 6 > 0, and for i.i.d random variables Z, Zi, Z2, ■ ■ ■ , Zm with Z G [0, 1] and 
m > 4. the followings hold with probability at least 1 — 6 



m ^-^ V nT- 3m 

i=l 



m ^-^ V 3m 

i=l 

where Vm = ^i<:jiZi - Zj)"^ /2m{m - 1). 

It is noteworthy that the bound in Eqn. (jl]) is similar to but improves slightly the bound of 



Maurer and Pontil 



28l . Theorem 4], and we also present a lower bound as shown in Eqn. ([5]). 



This proof is deferred to Section 16. 2| which is simple, straightforward and different from 



We now present our first main theorem: 

Theorem 6 For any 5 > 0, with probability at least 1 — 5 over the random choice of sample S 
with size m> A, every voting classifier f satisfies the following bound: 

2 

Vi\yf{x) < 01 < — + inf 

where 

8 2|?- 
fi = -^lnmln(2|?^|) + ln^ 
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This proof is based on the techniques developed by Schapire et al. [35(, and the main difference 
is that we utilize the empirical Bernstein bound of Eqn. @ in Theorem [5] for the derivation of 
generalization error. The detailed proof is deferred to Section 16.31 



It is noteworthy that Theorem [6] shows that the generalization error can be bounded in term of 
the empirical margin distribution Prs[yf{x) < 9], the training sample size and the hypothesis 



complexity; in other words, this bound considers exactly the same factors as Schapire et al. 



in Theorem [H However, the follo wing corollary shows that, the bound in Theorem [6] is tighter 



than the bound of Schapire et al. 

n 

Breiman [10|] in Theorem [21 



35l | in Theorem [H as well as the minimum margin bound of 



Corollary 3 For any 5 > 0, if the minimum margin 9i = yif{xi) > and m > 4, then we have 

7fii + 3a/2//i 



inf 

9G(0,1] 



< 



3m 



(6) 



where ji = ^\nm\n{2\'H\)/e'^ +\n{2\H\/5) and m = 81nmln(2|^|)/6'f + ln(2|^|/5); moreover, if 



the fallowings hold 



Oi =yif{xi) > 

jy 321n2|-H| ^ r, 
m6f — 



m 



1 

> max {4, exp (je^^ In ^) } , 



(7) 
(8) 
(9) 



then we have 



— + inf 

m 6»G(0,ll 



Pr[y/(x) < 



7/i + 3V2/I 
3m 



^FAyf{x)<e] 
m s 



1 \ 1 17^1 
< ii( ln(2m) + In— + 1 ) + — In^. (10) 
R / m 



This proof is deferred to Section 16. 4[ From Eqn. ([6]), we can see clearly that the bound of 
Theorem[6]is 0(lnm/m), uniformly tighter than the bound of Schapire et al. [35[ in Theorem [H 
In fact, we could also guarantee that bound of Theorem [U] is 0(lnm/m) even under weaker 
condition that ykf{xk) > for some k < O(lnm). It is also noteworthy Eqns. ([7|) and ([8|) are 
used here to guarantee the conditions of Theorem [21 and Eqn. (|10p shows that the bound of 
Theorem [His tighter than Breiman's minimum margin bound of Theorem [2l for large-size sample. 



10 



3 



Breiman jlO] doubted the margin theory because of two recognitions: i) the minimum margin 



35( 1 ■ and 



bound of Breiman [10] is tighter than the margin distribution bound of Schapire et al. 
therefore, the minimum margin is more essential than margin distribution to characterize the 
generalization performance; ii) arc-gv maximizes the minimum margin, but demonstrates worse 
performance than AdaBoost empirically. However, our result shows that the margin distribution 
bound in Theorem[T]can be greatly improved so that it is tighter than the minimum margin bound, 
and therefore, it is natural that AdaBoost outperforms arc-gv empirically on some datasets; in 
a word, our results provide a complete answer to Breiman's doubt on margin theory. 



We can also give a lower bound for generalization error as follows: 

Theorem 7 For any 6 > 0, with probability at least 1 — 6 over the random choice of sample S 
with size m > 4, every voting classifier f satisfies the following bound: 



Pr[y/(x) < 0] > sup 



^ 6»e(o,i] 



Pr[y/(x) < -9] - J^PAygix) < 0] + 



s \ m s 3m 

where fi = 81nmln(2|?^|)/0^ + ln{2\n\/6). 



2 

m 



The proof is based on Eqn. ^ in Theorem [5] and we defer it to Section 16.51 We now introduce 
the second main result as follows: 

Theorem 8 For any 6 > 0, with probability at least 1 — 6 over the random choice of sample S 
with size m > 4, every voting classifier f satisfies the following bound: 

Friyfix) < 0] < ^ + inf 



2//^ / -21nm 

^X(fyj + exp^^^_^2[y^(^)]+^/9) 



where /i = 1441nmln(2|?^|)/e2 ^ ln(2|^|/(5) and±{e) = Vxs[yf{x) < e]Fis[yfix) > 29/3]. 

It is easy to find in almost all boosting experiments that the average margin Es[yf{x)] is positive. 
Thus, the bound of Theorem [8] can be tighter when we enlarge the average margin. The statistics 
X(-) reflects the margin variance in some sense, and the term including X(-) could be small or 
even vanished except for a small interval when the variance is small. Similarly to the proof of 
Eqn. ([6]), we can show that the bound of Theorem [8] is still 0(lnm/m). 
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Figure 1: Each curve represents a voting classifier. The X-axis and F-axis denote instance and margin, respectively, 
and uniform distribution is assumed on the instance space. The voting classifiers hi, /12 and /13 have the same 
average margin but with different generalization error rates: 1/2, 1/3 and 0. 

Theorem [8] provides a theoretical support to the suggestion of Reyzin and Schapire [sJ], that is, 
the average margin can be used to measure the performance. It is noteworthy that, however, 
merely considering the average margin is insufficient to bound the generalization error tightly, as 
shown by the simple example in Figure [TJ Indeed, "average" and "variance" are two important 
statistics for capturing a distribution, and thus, it is reasonable that both the average margin 
and margin variance are considered in Theorem [8l 



6. Proofs 



In this section, we provide the detailed proofs for the main theorems and corollaries, and we 
begin with a series of useful lemmas as follows: 

Lemma 1 (Chernoff bound jisl) Let X, Xi, X2, ■ ■ ■ ,Xm be i.i.d random variables with X G 
[0, 1] . Then, the fallowings hold for any e > 0, 

.2^ 



Pr 



Pr 



1 VX, >E[X]+e 
m ^-^ 

i=\ 
^ m 

-y^Xi<E[X]-e 



< exp 



i=l 



< exp 



me 



me 



Lemma 2 (Relative entropy Chernoff bound [21]) The following holds for < e < 1 



E 

i=0 



""VtvII - e^r"' < exp (-mKL (- — ^lle 
I / V V m 11 
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Lemma 3 (Bernstein inequalities [Qj]) Let X, Xi, X2, ■ ■ ■ ,Xm be i.i.d random variables with 
Xi G [0, 1]. Then, for any 6 > 0, the foUowings hold with probability at least 1 — 5, 



i=l 
m 



m 'f— ' V m 6m 



m ^-^ V m 3m 

i=l 



{ID 



(12) 



where V{X) denotes the variance E[{X — E[X])'^]. 

6.1. Proof of Theorem^ 

We begin with a lemma as follows: 

Lemma 4 For f G C{7i) and g E Cn{T~L) chosen i.i.d according to distribution Q{f). If 
Vkf{xk) > ^ cLnd ykg{xk) < ol with 9 > a, then there is an instance {xi,yi) in S such that 
Vifixi) > and yig{xi) < a. 

Proof: There exists a bijection between {yjf{xj): j G [m]} and {yjg{xj): j G [m]} according to 
the original position in S. Suppose ykfixk) corresponds to yig{xi) for some /. If / < A; then the 
example {xk,yk) of ykf{xk) is desired; otherwise, except for {xk,yk) of ykfixk) in 5', there are at 
least m — k elements larger than or equal to 9 in {yjf{xj) : j G [m] \ {k}} but at most m — k — 1 
elements larger than a in {yjg{xj) : j G [m] \ {I}}. This completes the proof from the bijection. □ 



Proof of Theorem^ For every / G C{T-L), we can construct a G Cn{T-L) by choosing N elements 
i.i.d according to distribution Q{f), and thus Eg^Q(^j^-^[g] = f. For a > 0, the Chernoff's bound 
in Lemma [T] gives 



Pr[y/(x) < 0] = ^Pr^^[y/(x) < 0,yg{x) > a] + ^Pr^^[y/(x) < 0,yg{x) < a] 

< exp(-iVa^/2) + Pr [yg(x) < a]. 

D,Q{f) 

For any eAr > 0, we consider the following probability: 



(13) 



Pr 

< Pr 

5~D" 



PT[yg{x) < a]> I[ykg{xk) <a\+^N 



VkQixk) > a 



Pr[yg{x) < a] > 



< 



k-l 

E 

1=0 



m 



(14) 
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where ykg{xk) denotes the /cth margin with respect to g. For any A;, Eqn. (jl4l) can be bounded 
by exp ( — mi^L(^^| |e7v)) from Lemma [2l for constant k with m > Ak, we have 

g (^)65v(l - e^r-' < k{l - e^r/' (^!! i) < km'-\l - e^T'^ 



By using the union bound and \Cn{T-L)\ < \'H\^ , we have, for any k G [m], 



Pr 



35- G CAr('?^),3a G ^,Pr[y5(x) < a] > I[ykg{xk) <a] + eN 



< |-H|^+iexp(^-mi^L(^||ejv)) • 



Setting 6n = l^r+^exp ( - mKL(^||ejv)) gives = i^L"! hi M^) . Thus, with 

probabihty at least 1 — djy over sample S", for all / G C('?^) and all a G ^, we have 

FT[ygix) <a]< I[ykg[xk) <a] + KL~' ; - In ^ . (15) 

D \ m m On J 

Similarly, for constant k, with probability at least 1 — 5n over sample 5", it holds that 

Vi[yg{x) <a]< I[ykg{xk) < a] + - In ^ . (16) 

D m On 

From Eg^Q^^f)[I[ykg{xk) < a]] = 'PTgr^Q{f)[ykg{xk) < a], we have, for any 6 > a, 

Pr [ykg{xk) <a\< I[ykf{xk) <9]+ Pr [ykf{xk) > 0,ykg{xk) < a]. (17) 

9~2(/) 9~S(/) 

Notice that the instance {xk,yk) in {yif{xi)} may be different from instance {xk,yk) in {yig{xi)}, 
but from Lemma HI the last term on the right-hand side of Eqn. (|17p can be further bounded by 

^JPr^^[3(x„2/,) G S: yj{xi) > e,yig{xi) < a] < mexp{-N{e - af/2). (18) 

Combining Eqns. p^ . (fT5|) . (fT7|) and ([TH]) . we have that with probability at least 1 — (Jat over the 
sample 5, for all / G C{J-i), all 6* > a, all A; G [m] but fixed A'": 

Pr[y/(x) < 0] < I[ykf{xk) < 0] + m exp(-Ar(0 - q) V2) + exp(-A^aV2) 

+ ;-ln^ . 19 

\ m m On J 

To obtain the probability of failure for any at most 5, we select 6n = 5/2^ . Setting a = 
I - G ^ and iV = ^ In with < ?? < 1, we have 

exp(-iVaV2) +"iexp(-7V(6l - af/2) < 2mexp{-N9'^ /8) < ln\-H\/m 
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from the fact 2m > exp(iV/(2|^|)) for 9 > ^/s/M- Finally we obtain 

Pr[y/(x) < 0] < limfixk) <0] + ^ + (^\\^) 

m \ m m J 

where q = ^^^^^^ In + Inj?^! + In^. This completes the proof of Eqn. ([2]). In a similar 
manner, we have 

, , , n r . ^ n l^-l 2 /81n(2|7^|) 27^^ , , kmJ'-^X 
Pr[y/(x) < 0] < mf{xk) <G] + ^ + - [ \l hTM ^ ^ J ' 

for constant k with m > 4A;. This completes the proof of Eqn. ([3]) as desired. □ 



6. 2. Proof of Theorem 

For notational simplicity, we denote hy X = (Xi, X2, . . . , X^,) a vector of m i.i.d. random 
variables, and further set X^'^ = {Xi, . . . , Xk^i,Y, X^+i, . . . , Xm), i-e., the vector with the the 
fcth variable Xk in X replaced by variable Y . We first introduce some lemmas as follows: 

Lemma 5 (McDiarmid Formula [29]) Let X = {Xi,X2, ■ ■ ■ ,Xm) be a vector ofm i.i.d. ran- 
dom variables taking values in a set A. For any k G [m] and Y £ A, if \F{X) — F{X^'^)\ < 
for F : A^ — )• M, then the following holds for any t > 



Pr [F{X) - E[F{X)] >t\< exp i^J^^ 



Lemma 6 (Theorem 13 



271 ]) Let X = (Xi,X2, . . . ^Xm) be a vector of m i.i.d. random vari- 



ables tanking values in a set A. If F: A"^ — )• M satisfies that 

™ / _ _ \2 

mf F(X^'^)<1 and mf < F{X), 

k=l ^ ^ 
then the following holds for any t > 0, 

Ft[E[F{X)] - F{X) >t\< exp(-tV2^[F(X)]). 

Lemma 7 For two i.i.d random variables X and Y , we have 

E[{X - Yf] = 2E[{X - E[X]f] = 2V{X). 

Proof: This lemma follows from the obvious fact E[{X - Y^] = E{X'^ + Y'^ - 2XY) = 2E[X'^] - 
2E'^[X] = 2E[{X - E[X]f]. □ 
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Theorem 9 Let X = (Xi,X2, ■ ■ ■ ,Xm) be a vector of m > 4: i.i.d. random variables with values 
in [0,1], and we denote by 



2m(m 



— -y^(Xi-x. 



Then for any 5 > 0, we have 



Pr 



E[VmiX)] < ^Vm{X) 



In 1/(5 
16m 



Pr 



E[Vm{X)] > yJVmiX) + 



2 In 1/(5 



m 



< 6. 



(20) 



(211 



The bounds in this theorem are tighter than the bounds of [28l, Theorem 10], in particularly for 
Eqn. ()20p . However, our proof is simple, direct and different from work of Maurer and Pontil. 



Proof of TheoremOWe will utilize Lemmas and E] to prove Eqns. (|20p and (j21|) . respectively. 
For Eqn. (j20p . we first observe that, for any k G [m], 



Vm{X) - JVrr.iX'^'^) 



Vm{X) - v^ix 



k,Y\ 



Vn^iX) + y/VrniXf^'y) 



< 



1 



2V2'. 



m 



where we use Vm{X), Vm{X'''^) < 1/2 from Xi G [0, 1]. By using the Jenson's inequality, we have 



E[VVm{X)] < \ E[Vm{X)] and thus, 



Pr 



E[V^iX)] < JVm{X)-e 



< Pr 



E 



< JVmiX)-e 



< exp(— 16me^ 



where the last inequality holds by applying McDiarmid formula in Lemma [5] to \/Vm- Therefore, 
we complete the proof of Eqn. ((20]) by setting 6 = exp(— 16me^). 



For Eqn. we set (,m{X) = mVm{X). For Xi G [0, 1] and (,rn{X'''^), it is easy to obtain the 
optimal solution by simple calculation 

X,, 



y* = arginf^,[o,i][e,n(^''^)] = 
which yields that 



y^k m — 1 



inf [^^{X 
yG[o,i] 



jzix. - x,f - (Y* - x.f = {x,-Y.^;^' 



m 
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For Xi G [0, 1], it is obvious that 

UW- inf Je™(^'''')]<l, 
y6[o,i] 



and we further have 

k=l ' fc=l i^k 

(m-l)4m^V ^ m) - im-lY\m^\ ^ m ) j ^ ' 

^ ' k=\ i=\ ^ ' \ k=\ i=\ / 

where we use the Jenson's inequahty E\a^\ < E'^[a'^]. From Lemma [71 we have 

ly(x,-y^y<^ Yix, - Xkf = ^ Yix, - Xkf. 

k=l 1=1 i,k i^k 

Substituting the above inequality into Eqn. (j22p . we have 



3/1 \ 

^— ' Ye[o,i] 4(m — ly \ m{m — 1) ^-^ j 

k=l \ iy^k j 



< 1 Vly X 1^ 



2 

m 



2(m - 1)2 

where the second inequahty holds from ^^^-^^(^j — Xj.)"^ /m{m — 1) < 1 for Xi G [0, 1] and the 
last inequality holds from m > 4. Therefore, for any t > 0, the following holds by using Lemma [6] 
to 



Vx[E[V^{X)] - V^{X) >t] = Ft[E[U{X)] - U{X) > mt] < exp 
Setting 6 = exp{-mt'^ /2E[Vm{X)]) gives 



[2E[V^{X)] 



Pr 



< 6 



E[V^{X)]-V^iX) > ^^2E[V^{X)]Hl/6)/m 

which completes the proof of Eq. (I2ip by using the square-root's inequality and \/a + b < ^/a+^/b 
for a, 6 > 0. □ 

Proof of Theorem [5] For i.i.d. random variables X = (Xi, X2, ■ ■ ■ , Xm), we set Vm{X) = 
"^ij^jiXi — Xj)2/2m(m — 1), and observe that 

^I*'-(^>1 = 2S«S^E^l(^' - ^^''1 = 2iS(;bT) S^^l-^-l -^^'l-^-l = ''(•^'>' 
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where V{Xi) denotes the variance V{Xi) = E[{Xi — E[Xi])'^]. For any 6 > 0, the following holds 
with probability at least 1—5 from Eqn. (jlip . 



E[X] - 1 y X, < / ^^Wl^V'^ + hill = mVm{X)]lnl/6 ^ 



m V m 3m V "t- 3m 

i=l 

which completes the proof of Eqn. (jj]) by combining with Eqn. (|2ip in a union bound and simple 
calculations. Similar proof could be made for Eqn. ([5]). □ 

6. 3. Proof of Theorem 

Similarly to the proof of Theorem [U we have 

Pr[yf{x) < 0] < exp(-7VaV2) + ^Pr^^[y5(x) < a], (23) 

for any given a > 0, / G C{'H) and g £ CNiJ-L) chosen i.i.d according to Q(/). Recall that 
|Civ(^)| < |?^|^. Therefore, for any 5]^ > 0, combining union bound with Eqn. @ in Theorem[5] 
guarantees that the following holds with probability at least 1 — 6n over sample S, for any 
g G Cn{T-L) and a £ A, 



VT[yg{x)<a]<VAyg{x)<a] + J-V^\n{l^\n\^^^^ (24) 
D s \ m On 3m On 

where 

{I[yi9{f{xi)) < a] - I[yjg{f{xj)) < a]f 



2m(m — 1) 
Furthermore, we have 

X] {^[yi9{f{x-i)) < a] - I[yjg{f{xj)) < a])^ = m^ Pr[yc/(2;) < a] Pr[y5((2;) > a], 

i<j 

which yields that 
Tn 

Vm = 2m -2 ^s^y^^^'^ < P^^ivaix) > a] < Pj[ygix) < a], (25) 
for m > 4. By using Lemma [T] again, the following holds for any 6i > 0, 

Fiiygix) <a]< exp{-Nel/2) + Fr[yf{x) <a + 9i]. (26) 
s s 
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Setting ei=a = e/2 and combining Eqns. ([23]), ([Ml), ([25]) and ([Ml), we have 



Pr[y/(x) < 0] < Pr[y/(x) < + 2 exp(-iVeV8) 
D S 

where /i = ln(2|'H|^+^/57v)- By utihzing the fact \/a + 6 < -y/a + for a > and 6 > 0, we 
further have 

Finally, we set J^r = 6/2^ so that the probability of failure for any N will be no more than 6. 
This theorem follows by setting N = 81nm/0^. □ 



6.4-- Proof of Corollary \^ 



If the minimum margin 9i = yif{xi) > 0, then we have Pr5[y/(x) < ^i] = and further get 



inf 

6»G(0,1] 



PT[yf{x) <9] + + J^Fj[yfix)<0] 

S 6m \ m S 



< Pr[y/(x) < e,] + 7^1+ 3 + Pr[y/(^) < e,] 

S 6m \ m S 

3m ' 



(27) 



where m = 81nmln(2|?^|)/6'f + ln{2\n\/5). This gives the proof of Eqn. 1^. If m > 4, then we 
have 



> -2 lnmln(2|?^|) > 5 leading to v2/ii < 2^i/3. 
Therefore, the following holds by combining Eqn. (j27p and the above facts. 



— + inf 

m 0G(o,i] 



FAyfix) <e] + !^i±V2^ + J^Fr[yf{x) < 6] 
S 6m \ m S 



2 7fii + 3^2/17 2 6i2 



<— + 
m 



3m 
8 241nm 



< — + 



+ 



241nm 



m m m mt 



H2\n\) + — In 



3 , 2\n\ 



1 



m 



< — + 



m m6 



^ ln(2|?^|) + A In ^ < ii( ln(2m) + In ^ + l) + ^ In ^ 



m 



m 5 



where the last inequality holds from the conditions of Eqn. ([9]) and 8/m < R. This completes 
the proof of Eqn. ([TOD. □ 
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6. 5. Proof of Theorem 



Proof: For any given a > 0, / G C{T-L) and g G Cn{T-L) chosen i.i.d according to Q(/), it holds 
that from Lemma [U 

Pr[y/(x) > 0] < ^V^^^\jjg{x) > -q] + exp(-7VaV2), 
which yields 

Pr[y/(x) < 0] > ^Vi^^jygix) < -a] - exp{-Na^/2). (28) 

Recall that \Cn{'H)\ < |?^|^. Therefore, for any 6]\j > 0, combining union bound with Eqn. ([5]) 
in Theorem [5] guarantees that the following holds with probability at least 1 — 6]\f over sample S, 
for any g G CNiJ~i) and a ^ A, 



Vv[yg{x) < -a] > Pr[yg{x) < -a] - J -V^ln - ^H^\'Hf+'), (29) 

D s \ m ^On 3m On 



where 

T> V- (.liyiaifi^i)) < -a]-l[yjg{f{xj)) < -a])^ ^ p r ^ w 1 r ^ . 
By using Lemma [T] again, it holds holds that. 



Pv[yg{x) < -a] < Fr[yg{x) < 0] + exp{-Na^ /2), 

s s 

,2 



'PT[yg{x) < -a\ > Fi[yg{x) < -2a] - exp{-Na /2). 

s s 

Therefore, combining the above inequalities with Eqns. (|28|) and (|29p . we have 



Pr[y/(x) < 0] > Pr[y/(2;) < -2a] - 2 exp(-iVa72) 



/ 2Pr5Mx) <0] + 2exp(-iVaV2) ^^ 2 7 2 

Set = 2a and Sn = 6/2^ so that the probability of failure for any A'^ will be no more than 5. 
This theorem follows by using ^Ja + h < \/a + \fh and setting = Slnm/^^. □ 



6. 6. Proof of Theorem [3 

Our proof is based on a new Bernstein-type bound as follows: 
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Lemma 8 For f € C(^) and g G CNi'H) chosen i.i.d according to distribution Q{f), we have 

( -Nt^ \ 

Pr \yq{x) — vf(x) > tl < exp — rr- — — — — . 

5,5~s(/) ^""^^ ' yjK J - i- 2El[yf{x)] + At/3 J 

Proof: For A > 0, we utilize the Markov's inequality to have 

^FT^^^Iyg{x)-yf{x)>t] = ^ Fr^J{yg{x) - yf{x))NX/2 > NXt/2] 



XNt 
~2 



< exp ( — ] Es^g^Q^f) 



N 

= exj>i-XNt/2)'[[Es,h,^Q(^f)[eMKyhj{x) - yf{x))/2)], 
i=i 

where the last inequality holds from the independence of hj. Notice that \yhj{x) — yf{x)\ < 2 
from Ti C {h: X — )• { — 1,+1}}. By using Taylor's expansion, we further get 

Es,H,^QU)[^MKyhj{x) - yf{x))/2)] < 1 + Es,H,^Qif)[{yh,{x) - yf{x)f]{e^ - 1 - A)/4 
= 1 + Es[l - {yf{x)f]{e^ - 1 - A)/4 < exp ((1 - i?|[y/(x)])(e" - 1 - A)/4) , 



where the last inequality holds from Jensen's inequality and 1 + x < e^. Therefore, it holds that 

Pr [yg{x) - yf{x) >t\< exp (iV(e^ - 1 - A)(l - El[yf{x)])/A - XNt/2) . 
If < A < 3, then we could use Taylor's expansion again to have 

yi y2 - A™ A2 



00 -J X 9 00 



e 



i\ - 2^3"" 2(1 - A/3) 

i=2 i=0 \ I ) 



Now by picking A = t/(l/2 - Ei,{yf{x)\l2 + t/3), we have 
At ^ X\\-El\y^{x)\) ^ -t^ 



2 ' 8(1 -A/3) - 2-2^1 [?//(x)] +4t/3' 
which completes the proof as desired. □ 

Proof of Theorem [8] This proof is rather similar to the proof of Theorem [6l and we just give 
main steps. For any a > and (^at > 0, the following holds with probability at least \ — over 
sample Sm ("^ > 4), 



No?\ , ,/2V;^ln(^^^ 7 ,^,2,^,^^, 



Pr[,/(x) < 0] < Pr[,,(x) < a] + exp ( J + y ^ + - ln(-|7^ 
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where = Pisiugix) < a] Pics[yg{x) > a]. For any 9i > 0, we use Lemma [U to obtain 

V;; = Pr[yg{x) < a]Fi[yg{x) > a] < 3exp{-N9l/2) + Pi[yf{x) < a + ei]Pi[yf{x) > a-Oi]. 
s s s s 

Prom Lemma [HI it holds that 

Pr[yo(x) < a] < Pi\yf(x) < a + 9i \ + exp ( ^- — —^^ — ). 

Let 9i = 6/6, a = 59/6, and set Sn = d/2^ so that the probabihty of failure for any will be 
no more than 6. We complete the proof by setting N = 1441nm/0^ and simple calculation. □ 



7. Empirical Verifications 



Though this paper mainly focuses on the theoretical explanation to AdaBoost, we also present 
empirical studies to compare AdaBoost and arc-gv in terms of their performance so as to verify 
our theory. 

We conduct our experiments on 51 benchmark datasets from the UCI repository [2], which 
show considerable diversity in size, number of classes, and number and types of attributes. The 
detailed characteristics are summarized in Table [2l and most of them are investigated by previous 
researchers. For multi-class datasets, we transform them into two-class datasets by regarding the 
union of a half number of classes as one meta-class, while the other half as another meta-class, 
and the partition is selected by making the two meta-classes be with similar sizes. To control the 
complexity of base learners, we take decision stumps in our experiments as the base learners for 
both AdaBoost and arc-gv. On each dataset we run 10 trials of 10-fold cross validation, and the 
detailed results are summarized in Tables [TJ 



As shown by previous empirical work 



Id, 



34I ]. we can see clearly from Tables [T] that AdaBoost 
has better performance than arc-gv, which also verifies our Corollary [3j On the other hand, it 
is noteworthy that AdaBoost does not absolutely outperform arc-gv since the performances of 
two algorithms are comparable on many datasets. This is because that the bound of Theorem [6] 
and the minimum margin bound of Theorem [2] are both O (In m/m) though former has smaller 
coefficients. 
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8. Conclusion 



The margin theory provides one of the most intuitive and popular theoretical explanations to 
AdaBoost. It is well-accepted that the margin distribution is crucial for characterizing the per- 
formance of AdaBoost, and it is desirable to theoretically establish generalization bounds based 
on margin distribution. 

In this paper, we show that previous margin bounds, such as the minimum margin bound and 
Emargin bound, are all single-margin bounds that do not really depend on the whole margin 
distribution. Then, we improve slightly the empirical Bernstein bound with different skills. As 
our main results, we prove a new generalization bound which considers exactly the same factors 

n n 

as Schapire et al. [35i] but is uniformly tighter than the bounds of Schapire et al. [351 ] and 
Breiman [l^, and thus provide a complete answer to Breiman's doubt on the margin theory. 
By incorporating other factors such as average margin and variance, we prove another upper 
bound which is heavily related to the whole margin distribution. Our empirical evidence shows 
that AdaBoost has better performance than but not absolutely outperform arc-gv, which further 
confirm our theory. 
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Table 1: Accuracy (mcanistd.) comparisons of AdaBoost and arc-gv on 51 benchmark datasets. The better 
performance (paired t-test at 95% significance level) is bold. The last line shows the win/tie/loss counts of 
AdciBoost versus arc-gv. 



Dataset 


Test error 


Dataset 


Test error 


AdaBoost 


Arc-gv 


AdaBoost 


Arc-gv 


anneal 


0.0047±0.0066 


0.0043±0.0067 


abalonc 


0.2203±0.0208 


0.2186±0.0224 


artificial 


0.3351±0.0197 


0.2666±0.0200 


auto-m 


0.1143±0.0471 


0.1085±0.0436 


auto 


0.0991±0.0670 


0.0996±0.0667 


balance 


0.0088±0.0119 


0.0093±0.0120 


breast-w 


0.0411±0.0221 


0.0413±0.0242 


car 


0.0502±0.0154 


0.0509±0.0168 


cmc 


0.2787±0.0288 


0.2872±0.0311 


colic 


0.1905±0.0661 


0.1935±0.0683 


credit- a 


0.1368±0.0410 


0.1622±0.0405 


cylinder 


0.2076±0.0509 


0.2070±0.0570 


diabetes 


0.2409±0.0423 


0.2551±0.0440 


german 


0.2486±0.0372 


0.2717±0.0403 


glass 


0.2045±0.0794 


0.2113±0.0848 


hcart-c 


0.1960±0.0701 


0.2161±0.0754 


heart-h 


0.1892±0.0623 


0.2006±0.0673 


hepatitis 


0.1715±0.0821 


0.1798±0.0848 


house-v 


0.0471±0.0333 


0.0471±0.0326 


hypo 


0.0053±0.0035 


0.0054±0.0034 


ion 


0.()721±0.()4:-i2 


0.()7()7±0.()421 


iris 


().()0()0±().()()0() 


O.OOOOiO.OOOO 


isolet 


0.1270±0.0113 


0.1214±0.0116 


kr-vs-kp 


0.0354±0.0106 


0.0326±0.0097 


letter 


0.1851±0.0076 


0.1778±0.0077 


lymph 


0.1670±0.0971 


0.1690±0.0972 


magic04 


0.1555±0.0078 


0.1578±0.0077 


mfeat-f 


0.0445±0.0136 


0.0471±0.0143 


mfeat-m 


0.0990±0.0190 


0.1048±0.0200 


mush 


O.OOOOiO.OOOO 


O.OOOOiO.OOOO 


musk 


0.0916±0.0413 


0.0926±0.0437 


nursery 


0.0002±0.0004 


0.0002i0.0004 


optdigits 


0.1060±0.0144 


0.1048±0.0129 


page-b 


0.0331±0.0068 


0.0325i0.0062 


pendigits 


0.0796±0.0083 


0.0788±0.0081 


satimage 


0.0565±0.0083 


0.0531i0.0080 


segment 


0.0171±0.0083 


0.0159±0.0083 


shuttle 


O.OOlOiO.OOOl 


0.0009i0.0001 


sick 


0.0250±0.0082 


0.0246±0.0079 


solar-f 


0.0440±0.0171 


0.0490i0.0182 


sonar 


0.1441±0.0697 


0.1863±0.0881 


soybean 


0.0245±0.0188 


0.0242i0.0174 


spamb 


0.0570±0.0107 


0.0553±0.0105 


spect 


0.1256±0.0386 


0.1250i0.0414 


splice 


0.0561±0.0128 


0.0605±0.0131 


tic-tac-t 


0.0172±0.0115 


0.0177i0.0116 


vehicle 


0.0435±0.0215 


0.0447±0.0231 


vote 


0.0471±0.0333 


0.0471i0.0326 


vowel 


0.1114±0.0276 


0.1026±0.0278 


wavef 


0.1145±0.0136 


0.1181i0.0141 


yeast 


0.2677±0.0344 


0.2841±0.0332 
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Table 2: Description of datasets: the number of instances, the number of class, the number of continuous and 
discrete features 



dataset 


#inst 


#class 


#CF 


#DF 


dataset 


#inst 


# class 


#CF 


#DF 


abalone 


4177 


29 


7 


1 


anneal 


898 


6 


6 


32 


artificial 


5109 


10 


7 




auto-m 


398 


5 


2 


4 


auto 


205 


6 


15 


10 


balance 


540 


18 


21 


2 


breast-w 


699 


2 


9 




car 


1728 


4 




6 


cmc 


1473 


3 


2 


7 


colic 


368 


2 


10 


12 


credit-a 


690 


2 


6 


9 


cylinder 


540 


2 


18 


21 


diabetes 


768 


2 


8 




german 


1000 


2 


7 


13 


glass 


214 


6 


9 




heart-c 


303 


2 


6 


7 


heart-h 


294 


2 


6 


7 


hepatitis 


155 


2 


6 


13 


house-v 


435 


2 




16 


hypo 


3772 


4 


7 


22 


ion 


351 


2 


34 




iris 


150 


3 


4 




isolet 


7797 


26 


617 




kr-vs-kp 


3169 


2 




36 


letter 


20000 


26 


16 




lymph 


148 


4 




18 


magic04 


19020 


2 


10 




mfeat-f 


2000 


10 


216 




inlbat-m 


2()()() 


10 


G 




murili 


8124 


2 




22 


musk 


476 


2 


166 




nursery 


12960 


2 


9 




optdigits 


5620 


10 


64 




page-b 


5473 


5 


10 




pendigits 


10992 


2 


16 




satimage 


6453 


7 


36 




segment 


2310 


7 


19 




shuttle 


58000 


7 


9 




sick 


3372 


2 


7 


22 


solar-f 


1066 


6 




12 


sonar 


208 


2 


60 




soybean 


683 


19 




35 


spamb 


4601 


2 


57 




spect 


531 


48 


100 


2 


splice 


3190 


3 




60 


tic-tac-t 


958 


2 




9 


vehicle 


846 


4 


18 




vote 


435 


2 




16 


vowel 


990 


11 




11 


wavef 


5000 


3 


40 




yeast 


1484 


10 


8 
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